Text Categorization with Support Vector Machines: Learning with Many Relevant Features

نویسنده

  • Thorsten Joachims
چکیده

This paper explores the use of Support Vector Machines (SVMs) for learning text classifiers from examples. It analyzes the particular properties of learning with text data and identifies why SVMs arc appropriate for this task. Empirical results support the theoretical findings. SVMs achieve substantial improvements over the currently best performing methods and behave robustly over a variety of different learning tasks. Furthermore, they are fully automatic, eliminating the need for manual parameter tuning. 1 I n t r o d u c t i o n With the rapid growth of online information, text categorization has become one of the key techniques for handling and organizing text data. Text categorization techniques are used to classify news stories, to find interesting information on the WWW, and to guide a user's search through hypertext. Since building text classifiers by hand is difficult and time-consuming, it is advantageous to learn classifiers from examples. In this paper I will explore and identify the benefits of Support Vector Machines (SVMs) for text categorization. SVMs are a new learning method introduced by V. Vapnik et al. [9] [1]. They are well-founded in terms of computational learning theory and very open to theoretical understanding and analysis. After reviewing the standard feature vector representation of text, I will identify the particular properties of text in this representation in section 4. I will argue that SVMs are very well suited for learning in this setting. The empirical results in section 5 will support this claim. Compared to state-of-the-art methods, SVMs show substantial performance gains. Moreover, in contrast to conventional text classification methods SVMs will prove to be very robust, eliminating the need for expensive parameter tuning. 2 T e x t C a t e g o r i z a t i o n The goal of text categorization is the classification of documents into a fixed number of predefined categories. Each document can be in multiple, exactly one, or no category at all. Using machine learning, the objective is to learn classifiers

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Universit at Dortmund Fachbereich Informatik Lehrstuhl Viii K Unstliche Intelligenz Text Categorization with Support Vector Machines: Learning with Many Relevant Features Text Categorization with Support Vector Machines: Learning with Many Relevant Features

This paper explores the use of Support Vector Machines (SVMs) for learning text classiers from examples. It analyzes the particular properties of learning with text data and identi es, why SVMs are appropriate for this task. Empirical results support the theoretical ndings. SVMs achieve substantial improvements over the currently best performing methods and they behave robustly over a variety o...

متن کامل

Text Categorization with Support Vector Machines: Learning with Many Relevant F Eatures Text Categorization with Support Vector Machines: Learning with Many Relevant F Eatures

This paper explores the use of Support Vector Machines (SVMs) for learning text classiers from examples. It analyzes the particular properties of learning with text data and identi es, why SVMs are appropriate for this task. Empirical results support the theoretical ndings. SVMs achieve substantial improvements over the currently best performing methods and they behave robustly over a variety o...

متن کامل

Improving the Operation of Text Categorization Systems with Selecting Proper Features Based on PSO-LA

With the explosive growth in amount of information, it is highly required to utilize tools and methods in order to search, filter and manage resources. One of the major problems in text classification relates to the high dimensional feature spaces. Therefore, the main goal of text classification is to reduce the dimensionality of features space. There are many feature selection methods. However...

متن کامل

An Improvement in Support Vector Machines Algorithm with Imperialism Competitive Algorithm for Text Documents Classification

Due to the exponential growth of electronic texts, their organization and management requires a tool to provide information and data in search of users in the shortest possible time. Thus, classification methods have become very important in recent years. In natural language processing and especially text processing, one of the most basic tasks is automatic text classification. Moreover, text ...

متن کامل

Support Vector Machines for Text Categorization Based on Latent Semantic Indexing

Text Categorization(TC) is an important component in many information organization and information management tasks. Two key issues in TC are feature coding and classifier design. In this paper Text Categorization via Support Vector Machines(SVMs) approach based on Latent Semantic Indexing(LSI) is described. Latent Semantic Indexing[1][2] is a method for selecting informative subspaces of featu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998